The following analysis focus on RNA-seq count data extracted from different cancer datasets from the Cancer Genome Atlas (TCGA). From the original TCGA data 50 cases (tumor samples) and 50 controls (normal samples) were randomly selected.
The packages needed for the analysis are the following:
library(sessioninfo)
library(tidyverse)
library(biomaRt)
library(MotifDb) # large collection of motifs across different species
library(seqLogo)
library(PWMEnrich)
library(PWMEnrich.Hsapiens.background)
library(GEOquery)
library(oligo)
library(pd.hg.u133.plus.2)
library(hgu133plus2.db)
library(genefilter)
library(limma)
library(pheatmap)
library(stringr)
library(GenomicFeatures) # to build object with specific information such as genomic coordinates
library(ggplot2)
library(edgeR) # designed to perform Differential Expression Analysis from RNA-Seq data
library(fgsea) # computation of enrichment scores and other statistics
library(org.Hs.eg.db) # database annotation human specific transcript from Ensembl
library(clusterProfiler) # uses Fisher test, explore gene list of interest against a reference
library(enrichplot) # build enrich map
library(ggnewscale)
library(DOSE)
library(pathview) # contains cartoons of pathways and we project on them our genes of interest
library(igraph)
In particular, we consider data coming from Lung adenocarcinoma.
load("./RData/Lung_adenocarcinoma.RData")
After loading the data, the following three data-frames are available:
raw_counts_df which contains the raw RNA-seq
counts;
c_anno_df, which contains sample name and condition
(case and control);
r_anno_df, which contains the ENSEMBL genes ids, the
length of the genes and the genes symbols.
To extract only protein coding genes from raw_counts_df
and r_anno_df, we use the biomaRt package.
First, we retrieve the information about the protein coding genes from Ensembl.
database <- useMart("ensembl")
datasetHuman <- useDataset("hsapiens_gene_ensembl", mart = database)
query <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "gene_biotype"),
filters = c("ensembl_gene_id"), values = r_anno_df$ensembl_gene_id, mart = datasetHuman)
query_protein_coding <- query[which(query$gene_biotype == "protein_coding"), ]
Then, we filter the data frames containing the raw counts and the annotation to keep only the protein coding genes.
indexes_r_anno_df <- which(r_anno_df$ensembl_gene_id %in% query_protein_coding$ensembl_gene_id)
r_anno_df_protein_coding <- r_anno_df[indexes_r_anno_df, ]
indexes_raw_counts_df <- which(rownames(raw_counts_df) %in% query_protein_coding$ensembl_gene_id)
raw_counts_df_protein_coding <- raw_counts_df[indexes_raw_counts_df, ]
To perform the differential expression analysis, we will use the
edgeR package.
It is important to remove genes with low signal that will have low statistical power. Since, we want to focus on the transcripts we can use to perform the analysis, we can filter raw counts data using a threshold of raw count > 20 in at least 5 replicates.
# count threshold
count_thr <- 20
# number of replicates with more counts than the count threshold
repl_thr <- 5
filter_vec <- apply(raw_counts_df_protein_coding,1, # go through all count matrices by rows
function(y) max(by(y, c_anno_df$condition, function(x) sum(x>=count_thr))))
# groups the values on each condition and sum all the values above the count
Then, we filter the previously updated data frames.
filter_counts_df <- raw_counts_df_protein_coding[filter_vec >= repl_thr, ]
dim(filter_counts_df) # 17289
## [1] 17289 100
filter_anno_df <- r_anno_df_protein_coding[rownames(filter_counts_df), ]
dim(filter_anno_df) # 17289
## [1] 17289 3
Now we check the library size of each sample (how many reads we have sequenced for each experiment)
size_df <- data.frame(sample = colnames(filter_counts_df), read_millions = colSums(filter_counts_df)/1e+06)
ggplot(data = size_df, aes(sample, read_millions)) + geom_bar(stat = "identity",
fill = "indianred", colour = "indianred", width = 0.7, alpha = 0.7) + coord_flip() +
theme_bw()
Library size of each sample.
Then, we visualize a boxplot of gene counts.
long_counts_df <- gather(as.data.frame(filter_counts_df), key = "sample", value = "read_number")
ggplot(data = long_counts_df, aes(sample, read_number + 1)) + geom_boxplot(colour = "deeppink4",
fill = "deeppink4", alpha = 0.7) + theme_bw() + scale_y_log10()
Boxplot of gene counts before normalization.
As we can see from the plots, there is a significant variability across samples in term of library sizes (reads per million). We have to take into account this aspect because the expected size of each count is the product of the relative abundance of that gene in that sample but also of the library size.
As we can see from the boxplot, we need to normalize our data before testing for differential expression. Normalization can be obtained using different methodologies. Among them, TMM (the default method) is a method that consider in the normalization also variables related to the library size.
To perform the DEG analysis, we first need to create a DGRList object
containing information about counts, annotation, samples and genes. To
do that, we use the function DGEList to create the input
for the following normalization step. This object contains information
about counts, samples and genes.
The normalization intra- and inter-sample is done using the function
calcNormFactors and specifying method = "TMM",
which is the weighted trimmed mean of M-values approach.
edge_c <- DGEList(counts = filter_counts_df, group = c_anno_df$condition, samples = c_anno_df,
genes = filter_anno_df)
# computing norm factors for TMM normalization
edge_n <- calcNormFactors(edge_c, method = "TMM")
Then, we create a cpm table containing the normalized expression values for each transcript expressed as counts per million (CPM).
# create a cpm table (normalized expression values)
cpm_table <- as.data.frame(round(cpm(edge_n), 2))
To see the effect of the normalization, we look at the boxplot distribution of gene expression signals after normalization.
# look at the boxplot distribution of gene expression signals after
# normalization
long_cpm_df <- gather(cpm_table, key = "sample", value = "CPM")
ggplot(data = long_cpm_df, aes(sample, CPM + 1)) + geom_boxplot(colour = "olivedrab",
fill = "olivedrab", alpha = 0.7) + theme_bw() + scale_y_log10()
Boxplot distribution of gene expression signals after normalization.
We notice that, with respect to the previous boxplot, after the normalization, the distributions are comparable. That means that now our data is ready to be tested for DE analysis.
We define the experimental design matrix, later needed to calculate the dispersion and fit. This matrix is based on the experimental design, so we define the conditions we want to test (case VS control).
design <- model.matrix(~0 + group, data = edge_n$samples)
colnames(design) <- levels(edge_n$samples$group)
rownames(design) <- edge_n$samples$sample
Once we normalized the data and the design, we proceed by calculating the dispersion fit.
# calculate dispersion and fit with edgeR (necessary for differential
# expression analysis)
edge_d <- estimateDisp(edge_n, design)
edge_f <- glmQLFit(edge_d, design)
The estimateDisp function tries to estimate the
variability at different levels: in the data, inter sample and intra
sample. The over dispersion of counts across the samples can be modeled
as a Poisson distribution, which can be approximated using a negative
binomial distribution by using glmQLFit function.
We define the contrasts (conditions to be compared).
contro <- makeContrasts("control-case", levels = design)
We performed a test through function glmQLFTest to
determine which genes are differentially expressed. Then, we selected
genes based on a 0.01 p-value cutoff and ordered them based on the log2
fold change.
# fit the model with generalized linear models
edge_t <- glmQLFTest(edge_f, contrast = contro)
DEGs <- as.data.frame(topTags(edge_t, n = 20000, p.value = 0.01, sort.by = "logFC"))
Then, we improve the selection, considering not only the p-value, but also the average expression of the genes (logCPM). We used the logFC value to assign genes to different classes:
up-regulated genes if \(logFC>1.5\)
down-regulated genes if \(logFC<-1.5\)
not significant genes (unchanged) otherwise.
DEGs$class <- "="
DEGs$class[which(DEGs$logCPM > 1 & DEGs$logFC > 1.5 & DEGs$FDR < 0.25)] = "+" # upregulated genes
DEGs$class[which(DEGs$logCPM > 1 & DEGs$logFC < (-1.5) & DEGs$FDR < 0.25)] = "-" # downregulated genes
DEGs <- DEGs[order(DEGs$logFC, decreasing = T), ]
head(DEGs)
table(DEGs$class) # 1193-, 877+, 14949=
##
## - + =
## 1199 884 9258
From the results, we saw that 1193 genes are down-regulated, 877 genes are up-regulated and 14949 genes have no changes in their regulation.
We then create the volcano plot.
input_df <- DEGs
xlabel <- "log2 FC case vs control"
ylabel <- "-log10 p-value"
par(fig = c(0, 1, 0, 1), mar = c(4, 4, 1, 2), mgp = c(2, 0.75, 0))
plot(input_df$logFC, -log(input_df$PValue, base = 10), xlab = xlabel, ylab = ylabel,
col = ifelse(input_df$class == "=", "grey70", "olivedrab4"), pch = 20, frame.plot = TRUE,
cex = 0.8, main = "Volcano plot")
abline(v = 0, lty = 2, col = "grey20")
Volcano plot
The volcano plot allows to have a quick visual identification of genes with large fold changes that are also statistically significant.
We also report the heatmap of only up and down-regulated genes.
To create an annotated heatmap focusing only on up- and
down-regulated genes we need first of all a matrix in which we select
genes with class “+” or “-. Moreover, we also need to assign a color to
each sample. In this case, we assign green to the case condition and
beige to the control, and save the corresponding color into the variable
cols. In this way, we can then set the
ColSideColors parameter in order to have a color code to
recognize the sample condition.
# plot only up and down regulated genes
cols <- c(ifelse(c_anno_df$condition == "case", "chartreuse4", "burlywood3"))
pal <- c("blue", "white", "red")
pal <- colorRampPalette(pal)(100)
heatmap(as.matrix(cpm_table[which(rownames(cpm_table) %in% DEGs$ensembl_gene_id[which(DEGs$class !=
"=")]), ]), ColSideColors = cols, cexCol = 0.5, margins = c(4, 4), col = pal,
cexRow = 0.2)
Heatmap up and down-regulated genes.
On the top of the heatmap, we can see the dendrogram indicating how distant our samples are, while on the left the dendrogram related to genes.
To simplify the later analysis, we save the list of differentially expressed genes in a text file.
up_DEGs <- DEGs[which(DEGs$class == "+"), ]
down_DEGs <- DEGs[which(DEGs$class == "-"), ]
write.table(up_DEGs, file = "output/up_DEGs.txt", row.names = F, col.names = T, sep = "\t",
quote = F)
write.table(down_DEGs, file = "output/down_DEGs.txt", row.names = F, col.names = T,
sep = "\t", quote = F)
write.table(DEGs, file = "output/DEGs.txt", row.names = F, col.names = T, sep = "\t",
quote = F)
To perform the gene set enrichment analysis, we will use the
clusterProfiler package.
DEGs <- read.table("output/DEGs.txt", header = T, sep = "\t", as.is = T)
table(DEGs$class)
##
## - + =
## 1199 884 9258
We use the biomaRt package to retrieve the
entrezgene_id for all the genes in the DEGs
dataset.
ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
convert <- getBM(attributes = c("ensembl_gene_id", "entrezgene_id"), filters = c("ensembl_gene_id"),
values = DEGs$ensembl_gene_id, mart = ensembl)
DEGs <- merge(DEGs, convert, by.x = "ensembl_gene_id", by.y = "ensembl_gene_id") # include the new info in the original data frame
DEGs <- DEGs[which(!is.na(DEGs$entrezgene_id)), ]
DEGs <- DEGs[-which(duplicated(DEGs$entrezgene_id)), ]
We removed all the NA and duplicates in the dataset DEGs.
We created new lists for the down and up-regulated genes.
# list of up-regulated genes
upDEGs <- DEGs %>%
filter(class == "+")
# list of down-regulated genes
downDEGs <- DEGs %>%
filter(class == "-")
# biological process up regulated genes
ego_BP_up <- enrichGO(gene = upDEGs$external_gene_name, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05)
# biological process down regulated genes
ego_BP_down <- enrichGO(gene = downDEGs$external_gene_name, OrgDb = org.Hs.eg.db,
keyType = "SYMBOL", ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05)
We report the top 10 enriched GO terms related to the Biological
Process for both up and down regulated genes.
In the barplots we can see that the elements are ordered by adjusted
p-value (where the most significant is placed on the top) and on the
x-axis we have the gene counts, so the number of elements of our lists
were found in the category.
barplot(ego_BP_up, showCategory = 10, main = "Up-regulated gene list: top 10 enriched BP terms")
Biological process up regulated genes
Both the first two biological processes are related to the microtubules, which might be explained by the fact that tumour cells have a high growth rate thus the cytoskeleton is assembled over and over again. We also notice that angiogenesis is among the most enriched biological processes as we expected. In fact, angiogenesis is particularly important in tumorigenesis to allow the growth of the tumour tissue, providing nutrients to cells.
barplot(ego_BP_down, showCategory = 10, main = "Down-regulated gene list: top 10 enriched BP terms")
Biological process down regulated genes
We would expect most of these biological processes to be enriched in the up-regulated genes since tumour cells tend to replicate more than normal cells. A possible explaination is that tumour cells genes accumulate numerous mutations usually ending up in loss of function or downregulation. These mutations might occur on transcription factors binding sites or RNA binding sites, reducing their expression.
The same analysis was done also for the molecular function.
# molecular function up regulated genes
ego_MF_up <- enrichGO(gene = upDEGs$external_gene_name, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
ont = "MF", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05)
# molecular function down regulated genes
ego_MF_down <- enrichGO(gene = downDEGs$external_gene_name, OrgDb = org.Hs.eg.db,
keyType = "SYMBOL", ont = "MF", pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05)
barplot(ego_MF_up, showCategory = 10)
Molecular function up regulated genes
barplot(ego_MF_down, showCategory = 10)
Molecular function down regulated genes
eWP_up <- enrichWP(gene = upDEGs$entrezgene_id, organism = "Homo sapiens", pvalueCutoff = 0.05,
qvalueCutoff = 0.1)
head(eWP_up, n = 10)
eWP_down <- enrichWP(gene = downDEGs$entrezgene_id, organism = "Homo sapiens", pvalueCutoff = 0.05,
qvalueCutoff = 0.1)
head(eWP_down, n = 10)
logFC <- upDEGs$logFC
names(logFC) <- upDEGs$entrezgene_id
pathview(gene.data = logFC, pathway.id = "WP2446", species = "human")
eWP_KEGG <- enrichKEGG(gene = upDEGs$entrezgene_id, organism = "human", pvalueCutoff = 0.05,
qvalueCutoff = 0.1)
head(eWP_KEGG, n = 20)
promoter_regions <- getSequence(id = downDEGs$ensembl_gene_id, type = "ensembl_gene_id",
seqType = "gene_flank", upstream = 500, mart = ensembl)
data(PWMLogn.hg19.MotifDb.Hsap)
seq <- lapply(promoter_regions$gene_flank, function(x) DNAString(x))
enriched_TFs = motifEnrichment(seq, PWMLogn.hg19.MotifDb.Hsap, score = "affinity")
## Calculating motif enrichment scores ...
report = groupReport(enriched_TFs)
report
## An object of class 'MotifEnrichmentReport':
## rank target id raw.score p.value
## 1 1 PGAM2 PGAM2 7.47058707343782 2.41159784654048e-86
## 2 2 TFAP4 M2944_1.02 3.42391273544219 3.36836197404242e-84
## 3 3 SP2 SP2 119.54842162525 6.25197615973972e-84
## 4 4 CEBPB M4556_1.02 2.30767167521828 1.7054813305699e-83
## 5 5 JUND M4506_1.02 6.8634137704217 1.50332672791467e-81
## 6 6 MAFK M4573_1.02 2.41935358645508 4.83938302985492e-81
## 7 7 ZMAT2 ZMAT2 2.16420212862343 9.19698805317131e-81
## 8 8 JUN M4591_1.02 2.76520576103839 2.36897498529839e-80
## 9 9 KLF5 KLF5 30.1023912345585 2.91297830264159e-79
## 10 10 NNT NNT 1.87741003326598 6.17885793377267e-79
## ... ... ... ... ... ...
## 2287 2001.5 Oct-1 NBT06/Oct-1.pwm 1.72351938773035 1
## top.motif.prop
## 1 0.177364864864865
## 2 0.178209459459459
## 3 0.185810810810811
## 4 0.174831081081081
## 5 0.185810810810811
## 6 0.171452702702703
## 7 0.163851351351351
## 8 0.168074324324324
## 9 0.158783783783784
## 10 0.173986486486486
## ... ...
## 2287 0.0278716216216216
Select one among the top enriched TFs, compute the empirical distributions of scores for all PWMs that you find in MotifDB for the selected TF and determine for all of them the distribution (log2) threshold cutoff at 99.75%.
tfs <- report$target[1]
tfs_motifs = subset(MotifDb, organism == "Hsapiens" & geneSymbol == tfs)
# tfs_motifs
# transformation to a PWM matrix
PWM = toPWM(as.list(tfs_motifs))
# PWM
ecdf = motifEcdf(PWM, organism = "hg19", quick = TRUE)
# calculate for each motif of my gene the empirical distribution scores ecdf
thresholds = lapply(ecdf, function(x) log2(quantile(x, 0.9975))) # for each of the distribution, take the quantile as reference
# thresholds
Identify which up-regulated (or down-regulated depending on the choice you made at point 7) genes have a region in their promoter (defined as previously) with binding scores above the computed thresholds for any of the previously selected PWMs. Use pattern matching as done during the course
scores = motifScores(seq, PWM, raw.scores = FALSE, cutoff = unlist(thresholds)) # in my promoters, i want to spot the regions that have higher probability to bind my transcription factors -> the function motifScores performs the pattern matching
highscore_seq <- which(apply(scores, 1, sum) > 0)
genes_id <- promoter_regions$ensembl_gene_id[highscore_seq]
# genes_id # down-regulated genes that have a region in their promoter with
# binding scores above the defined threshold for any PWMs
write(genes_id, "output/enriched_genes_id.txt")
We use STRING database to ind PPI interactions among differentially expressed genes. The network is exported in TSV format.
dat <- read.table("output/down_DEGs.txt", sep = "\t", header = TRUE)
write.table(dat$ensembl_gene_id, file = "output/ens_gene_id_down.txt", row.names = FALSE,
col.names = FALSE)
We import the network in R and, using igraph package, we
identify and plot the largest connected component.
links_high_conf <- read.delim("string_interactions_short_high_conf.tsv")
links <- read.delim("string_interactions_short.tsv")
downDEGs <- read.table("output/down_DEGs.txt", sep = "\t", header = TRUE)
ensembl <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
nodes <- getBM(attributes = c("external_gene_name", "ensembl_gene_id", "description",
"gene_biotype", "start_position", "end_position", "chromosome_name", "strand"),
filters = c("ensembl_gene_id"), values = downDEGs[, 1], mart = ensembl) # search info about nodes using ensembl_gene_id
nodes <- unique(nodes[, c(1, 3:6)])
nodes <- nodes[nodes$external_gene_name %in% links$X.node1 & nodes$external_gene_name %in%
links$X.node1, ]
links <- links[links$X.node1 %in% nodes$external_gene_name & links$node2 %in% nodes$external_gene_name,
]
# create network
net <- graph_from_data_frame(d = links, vertices = nodes, directed = FALSE)
plot(net)
c <- components(net, mode = c("weak", "strong"))
net.c <- induced_subgraph(net, V(net)[which(c$membership == 1)])
plot(net.c, edge.width = 2, vertex.color = "orange", vertex.size = 10, vertex.frame.color = "darkgray",
vertex.label.color = "black", vertex.label.cex = 0.7, edge.curved = 0.1)
deg <- degree(net.c, mode = "all")
names_nodes <- names(deg[which(deg > 150)])
nodes.deg <- nodes[nodes$external_gene_name %in% names_nodes, ]
nodes.deg <- nodes.deg[nodes.deg$external_gene_name %in% links$X.node1 & nodes.deg$external_gene_name %in%
links$node2, ]
links.deg <- links[links$X.node1 %in% nodes.deg$external_gene_name & links$node2 %in%
nodes.deg$external_gene_name, ]
net.deg <- graph_from_data_frame(d = links.deg, vertices = nodes.deg, directed = FALSE)
plot(net.deg, edge.width = 0.5, vertex.color = "orange", vertex.size = 10, vertex.frame.color = "darkgray",
vertex.label.color = "black", vertex.label.cex = 0.7, edge.curved = 0.1)
links_high_conf <- read.delim("string_interactions_short_high_conf.tsv")
downDEGs <- read.table("output/down_DEGs.txt", sep = "\t", header = TRUE)
ensembl <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
nodes <- getBM(attributes = c("external_gene_name", "ensembl_gene_id", "description",
"gene_biotype", "start_position", "end_position", "chromosome_name", "strand"),
filters = c("ensembl_gene_id"), values = downDEGs[, 1], mart = ensembl) # search info about nodes using ensembl_gene_id
nodes <- unique(nodes[, c(1, 3:6)])
nodes <- nodes[nodes$external_gene_name %in% links_high_conf$X.node1 & nodes$external_gene_name %in%
links_high_conf$X.node1, ]
links_high_conf <- links_high_conf[links_high_conf$X.node1 %in% nodes$external_gene_name &
links_high_conf$node2 %in% nodes$external_gene_name, ]
# create network
net <- graph_from_data_frame(d = links_high_conf, vertices = nodes, directed = FALSE)
plot(net)
c <- components(net, mode = "strong")
net.c <- induced_subgraph(net, V(net)[which(c$membership == 1)])
plot(net.c, edge.width = 2, vertex.color = "orange", vertex.size = 10, vertex.frame.color = "darkgray",
vertex.label.color = "black", vertex.label.cex = 0.7, edge.curved = 0.1)
deg <- degree(net.c, mode = "all")
names_nodes <- names(deg[which(deg > 110)])
nodes.deg <- nodes[nodes$external_gene_name %in% names_nodes, ]
nodes.deg <- nodes.deg[nodes.deg$external_gene_name %in% links_high_conf$X.node1 &
nodes.deg$external_gene_name %in% links_high_conf$node2, ]
links.deg.high <- links_high_conf[links_high_conf$X.node1 %in% nodes.deg$external_gene_name &
links_high_conf$node2 %in% nodes.deg$external_gene_name, ]
net.deg <- graph_from_data_frame(d = links.deg.high, vertices = nodes.deg, directed = FALSE)
plot(net.deg, edge.width = 0.5, vertex.color = "orange", vertex.size = 10, vertex.frame.color = "darkgray",
vertex.label.color = "black", vertex.label.cex = 0.7, edge.curved = 0.1)